Support Virtual-GenAI monitoring#13745
Conversation
|
Could you show the screenshow at least 3-4 mins data? One dot in only one minute is not very clear. |
There was a problem hiding this comment.
Review: Support Virtual-GenAI monitoring
Critical Issues
1. Config file exclude mismatch in server-starter/pom.xml
Exclude says gen-ai-settings.yml but actual file is gen-ai-config.yml.
2. requiredModules() returns empty array
GenAIAnalyzerModuleProvider uses CoreModule in start() but doesn't declare it in requiredModules(). Should return new String[] { CoreModule.NAME }.
3. Module naming convention violation
Existing analyzer modules use lowercase-hyphenated: agent-analyzer, log-analyzer, meter-analyzer. New module genAI-analyzer should be gen-ai-analyzer.
4. Package should be org.apache.skywalking.oap.server.analyzer.genai
Currently uses org.apache.skywalking.oap.meter.analyzer.* which collides with the existing meter-analyzer module's package. Since this is a trace span analyzer (shared across SkyWalking/OTEL/Zipkin receivers), the package should be org.apache.skywalking.oap.server.analyzer.genai.*.
Design Issues
5. Duplicate OAL metric
gen_ai_provider_resp_time and gen_ai_provider_latency_avg both compute from(GenAIProviderAccess.latency).longAvg(). Remove one.
6. totalCost semantics are confusing
Stored value is tokens * costPerM, dashboard divides by 1,000,000. Better to store actual cost by dividing at computation time.
7. Missing NamingControl in VirtualGenAIProcessor
Other virtual processors all use NamingControl to normalize service names. GenAI processor skips this.
8. Tag key inconsistency: gen_ai.stream.ttfr vs timeToFirstToken
Tag says "ttfr", field says "timeToFirstToken", doc doesn't mention this tag at all.
Code Quality Issues
9. GenAIConfigLoader constructor ignores Yaml parameter
Accepts Yaml but creates a new one in loadConfig().
10. fastjson dependency in e2e test
No new dependency version should be added directly in sub-module pom.xml.
Dependencies are managed by BOM. We have decided not to include this repo as it had a lot of critical CVEs before. We have to fix those(re-release patch version), it is too pain.
11. E2E Dockerfile clones unpinned external repo
Dockerfile.provider clones spring-projects/spring-ai-examples without pinning a commit/tag. Any upstream change could break the e2e test.
12. Documentation typo
virtual-genai.md: "Virtual cache represent the Generative AI service nodes" - copy-paste from virtual-cache doc.
Minor Issues
13. Missing newline at end of file in multiple files: gen-ai-config.yml, menu.yaml, SPI files, e2e expected YAMLs, dashboard JSONs.
14. GenAIModelAccessDispatcher bypasses normal dispatch flow - directly calls MetricsStreamProcessor.getInstance().in(traffic).
15. VirtualGenAIProcessor.recordList should be final.
16. Blank line in import block in VirtualServiceAnalysisListener.java between java.util and lombok imports.
There was a problem hiding this comment.
Additional issue: should use percentile2 instead of percentile
All production OAL files use percentile2(10). The old percentile function only exists in e2e test OAL for backward-compatibility testing.
In virtual-gen-ai.oal, the following lines should use percentile2:
gen_ai_provider_latency_percentile = from(GenAIProviderAccess.latency).percentile2(10);
gen_ai_model_latency_percentile = from(GenAIModelAccess.latency).percentile2(10);
gen_ai_model_ttft_percentile = from(GenAIModelAccess.timeToFirstToken).filter(timeToFirstToken > 0).percentile2(10);
And your UI doesn't show the correct percentile labels.
|
@wu-sheng |
|
UI side got merged. When you update this PR, please include the submodule update. |
|
not yet finish, some check fails in my local env, still fixing |
| @@ -0,0 +1,16 @@ | |||
| # Virtual GenAI | |||
There was a problem hiding this comment.
You need to update the demo to point to here. I think from Marketplace/General Service?
There was a problem hiding this comment.
Not just this. menu.yml is not updated in the /docs/en
|
e2e fails, please fix it. |
|
@wu-sheng |
…ng' of github.com:peachisai/skywalking into Support-GenAI-monitoring
|
ui submodule cannot push successfully, always loading . will try again. |
...c/main/java/org/apache/skywalking/oap/analyzer/genai/service/GenAIModelAccessDispatcher.java
Outdated
Show resolved
Hide resolved
…lking into Support-GenAI-monitoring
|
ui submodule had updated |
|
You have build errors. |
works on my side. could u retrigger the workflow again? it should be an occasional issue. |
| } | ||
|
|
||
| @Override | ||
| public GenAIMetrics extractMetricsFromSWSpan(SpanObject span, SegmentObject segment) { |
There was a problem hiding this comment.
Let's prepare a UT for this. As I noticed, the config file seems to have something missed, we should verify it is processed correctly.
There was a problem hiding this comment.
This UT should include loading files, matching rules, and estimated cost.
...lyzer/src/main/java/org/apache/skywalking/oap/analyzer/genai/service/GenAIMeterAnalyzer.java
Outdated
Show resolved
Hide resolved
...n-ai-analyzer/src/main/java/org/apache/skywalking/oap/analyzer/genai/config/GenAITagKey.java
Outdated
Show resolved
Hide resolved
Code style should be fixed. |
oap-server/server-starter/src/main/resources/oal/virtual-gen-ai.oal
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/apache/skywalking/oap/server/starter/config/GenAIMeterAnalyzerTest.java
Outdated
Show resolved
Hide resolved
...lking/oap/server/analyzer/provider/trace/parser/listener/vservice/VirtualGenAIProcessor.java
Show resolved
Hide resolved
...lyzer/src/main/java/org/apache/skywalking/oap/analyzer/genai/service/GenAIMeterAnalyzer.java
Outdated
Show resolved
Hide resolved
| gen_ai_model_total_estimated_cost = from(GenAIModelAccess.totalEstimatedCost).sum(); | ||
| gen_ai_model_avg_estimated_cost = from(GenAIModelAccess.totalEstimatedCost).doubleAvg(); No newline at end of file |
There was a problem hiding this comment.
doubleAvg() on a long field
gen_ai_provider_avg_estimated_cost = from(GenAIProviderAccess.totalEstimatedCost).doubleAvg();
gen_ai_model_avg_estimated_cost = from(GenAIModelAccess.totalEstimatedCost).doubleAvg();
totalEstimatedCost is long in the source classes, but doubleAvg() is designed for double inputs. Verify this compiles and works correctly at OAL code generation time — if the
generated code expects getXxx() returning double but gets long, there may be a type mismatch. Consider either changing the field to double or using longAvg() and adjusting the
dashboard expression.
| String providerName; | ||
| } | ||
|
|
||
| public static class MatchResult { |
There was a problem hiding this comment.
| public static class MatchResult { | |
| @Data | |
| public static class MatchResult { |
Use Lombok?
| final Map<Character, TrieNode> children = new HashMap<>(); | ||
| String providerName; |
There was a problem hiding this comment.
Private fields? and with @Data?
| * | ||
| */ | ||
|
|
||
| package org.apache.skywalking.oap.server.starter.config; |
There was a problem hiding this comment.
Let's move this into Analyzer, and you could copy gen-ai-config.yml file into test in that module as well.
| - provider: groq | ||
| prefix-match: | ||
| - llama |
There was a problem hiding this comment.
We should not use all llama, as a OSS model, it could be used by anyone. We could simply remove this part.
wu-sheng
left a comment
There was a problem hiding this comment.
LGTM. Thanks for making this all fixed.

CHANGESlog.